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Abstract — We present results from our cosmological N- 
body simulation which consisted of 2048x2048x2048 particles 
and ran distributed across three supercomputers throughout 
Europe. The run, which was performed as the concluding phase 
of the Gravitational Billion Body Problem DEISA project, 
integrated a 30 Mpc box of dark matter using an optimized 
Tree/Particle Mesh N-body integrator. We ran the simulation 
up to the present day (z=0), and obtained an efficiency of 
about 0.93 over 2048 cores compared to a single supercomputer 
run. In addition, we share our experiences on using multiple 
supercomputers for high performance computing and provide 
several recommendations for future projects. 

^Tejvforrfi -high-performance computing; distributed comput- 
ing; parallelization of applications; cosmology; N-body simu- 
lation 

I. Introduction 

Cosmological simulations are an efficient method to gain 
understanding of the formation of large-scale structures in 
the Universe. Large simulations were previously applied to 
model the evolution of dark matter in the Universe 0], 
and to investigate the properties of Milky-Way sized dark 
matter halos |2|, |3|. However, these simulations are com- 
putationally demanding, and are best run on large production 
infrastructures. We have previously run a cosmological sim- 
ulation using two supercomputers across the globe |4| with 
the GreeM integrator Q, @, and presented the SUSHI N- 
body integrator fT\, which we used to run simulations across 
up to four supercomputers. The simulations we ran in the 
Gravitational Billion Body Project produced over 110 TB 
of data, which we have used to characterize the properties 
of ultra-faint dwarf galaxies |8|, and to compare the halo 
mass function in our runs to analytical formulae for the mass 
function. Among other things, we found that the halo mass 
function in our runs shows good agreement with the Sheth 
and Tormen function |9^j down to ^ 10^ solar mass. 

Here we present the performance results of a production 
simulation across three supercomputers, as well as several 
other runs which all use an enhanced version of SUSHI. The 
production simulation ran continuously for ^ 8 hours, using 
2048 cores in total for calculations as well as 4 additional 
cores for communications. We achieved a peak performance 
of 3.31 X 10^^ tree force interactions per second, a sustained 
performance of2.19xl0^^ tree force interactions per second 
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and a wide area communication overhead of less than 10% 
overall. 

We briefly reflect on the improvements made to SUSHI 
for this work in Section 2, while we report on tests per- 
formed on a single supercomputer in Section 3. In Section 
4 we describe our experiments across three supercomputers 
and present our performance results. We reflect on our 
experiences on using multiple supercomputers for distributed 
supercomputing simulations, and provide several recommen- 
dations for users and resource providers in Section 4 and 
present our conclusions in Section 5. 

A. Related work 

There are a several other projects which have run high 
performance computing applications across multiple super- 
computers. These include simulations of a galaxy collision 
fTO|, a materials science problem flTl as well as an analysis 
application for arthropod evolution |12|. A larger number 
of groups performed distributed computing across sites of 
PCs rather than supercomputers (e.g., lfT3]| . |[14^, fTS^). 
Several software tools have been developed to facilitate high 
performance computing across sites of PCs (e.g., lfT6l . ifTTl . 
ri8l . ||19i . Ii20|) and within volatile computing environments 
The recently launched MAPPER EU-FP7 project |22] 
seeks to run multiscale applications across a distributed 
supercomputing environment, where individual subcodes 
periodically exchange information and (in some cases) run 
concurrently on different supercomputing architectures. 

II. Improvements to SUSHI 

Based on results of our earlier simulations and in prepa- 
ration for the production run across three supercomputers 
we made several modifications to the SUSHI distributed 
iV-body integrator. In our previous experiments a rela- 
tively large amount of computation and communication time 
was spent on (non-parallelize) particle-mesh integration. To 
reduce this bottleneck we now parallelized the particle- 
mesh integration routines using the parallel FFTW2 library 
1 23 1 and a one dimensional slab decomposition. We also 
optimized the communications of the particle-mesh integra- 
tion by introducing a scheme where sites only broadcast 
those mesh cells which have actual particle content. This 



Table I 

Initial condition and accuracy parameters used for our 
simulations with 2048^ particles. 



Parameter 


Value 


Matter density parameter (uio) 


0.3 


Cosmological constant (Aq) 


0.7 


Hubble constant {Ho) 


70.0 km/s/Mpc 


Mass fluctuation parameter (erg) 


0.8 


Box size 


(30Mpc)3 


Softening for 2048^ particle run. 


175 pc 


Sampling rate for 2048"^ particle run. 


20000 



optimization reduced the size of the mesh communications 
by a factor roughly equal to the number of sites used, in the 
case of an equal domain distribution. 

In some of the larger previous runs we also observed 
load imbalances if the code was run across two machines 
with different architectures, despite the presence of a load 
balancing scheme. This result has led us to further optimized 
the load balancing in SUSHI, taking into account not only 
the force integration time, but also the number of particles 
stored on each node. In addition to these changes, we also 
seized the opportunity to plug in a more recent MPWide 
|24J version into SUSHI. This newer version contains several 
optimizations to improve the wide area communication over 
networks with a high latency. 



III. Tests on a single site 



A. Setup 



We performed a number of runs on the Huygens super- 
computer to validate the scalability of our new implemen- 
tation, and to provide performance measurements against 
which we can compare our results using multiple sites. More 
information on the Huygens machine can be found in the 
second column of Tab. |lll] The initial conditions for this 
simulation is the snapshot at redshift z = 0.0026 from the 
CosmoGrid simulation (described in (4\). We also use the 
simulation parameters chosen for the CosmoGrid simula- 
tion, which are summarized in Tab. |l] Here the first four 
parameters are constants which are derived from WMAP 
observations (with a slight-roundoff) and the physical size 
of our simulated system is given by the fifth parameter 
(Box size). The softening in our simulation (i.e. a length 
value added to reduce the intensity of close interactions) 
and the sampling rate are given by the last two parameters. 
The sampling rate is the ratio of particles in the simulation 
divided by the number of particles sampled by the load 
balancing scheme. Our simulation used a mesh size of 512'^ 
cells. We ran the simulation using respectively 512 cores 
and 1024 cores until z — 0.0024, and using 2048 cores until 
the simulation completed (at z = 0). The number of force 
calculations per step in the simulation varies for different z 
values, though these variations are neglishible for z < 0.01. 



Table II 

Overview of experiments performed with the enhanced 
sushi code on the huygens supercomputer. the time spent 
on communication is given in the fourth column, while the 

total runtime is given in the fifth column. all times are 
measured per step, averaged over steps 1-1 1. in addition we 
included the timing results of the last 10 steps of the 
simulation running on 2048 cores (bottom row). 



N 


P 


e 


comm. t 


runtime 


2 range 


speedup 








[s] 


[s] 






2048^ 


512 


0.5 


19.18 


501.3 


2.5-2.4 


1 


2048^ 


1024 


0.5 


13.96 


258.2 


2.5-2.4 


1.94 


2048^ 


2048 


0.5 


22.34 


151.0 


2.5-2.4 


3.32 


2048^ 


2048 


0.5 


16.22 


143.7 


0.1-0.0 





B. Results 

The performance results of our runs are shown in Tab. [ll] 
In addition, the total runtime of the run using 2048 cores 
is given by the light blue line in Fig. |2] The overall 
performance of the code is dominated by calculations, with 
the communication overhead ranging from ^5% for 512 
cores to ^10-15% for 2048 cores. During the run using 
2048 cores, several snapshots were written. This resulted in 
a greatly increased execution time during two steps of the 
run. 

IV. Tests across three sites 

A. Setup 

We performed our main run using a total of 2048 cores 
across three supercomputers, which are listed in Tab. Ill 
These machines include Huygens in the Netherlands (1024 
cores), Louhi in Finland (512 cores), and HECToR in 
Scotland (512 cores). The sites are connected to the DEIS A 
shared network with either a IGbps interface (HECToR) or 
a lOGbps interface (Huygens, Louhi). The initial conditions 
and simulation parameters chosen are identical to those of 
the runs using 1 supercomputer, although we use a mesh 
of 256'^ cells. The use of a smaller mesh size results in 
a slightly higher calculation time as tree interactions are 
calculated over a longer range, but a somewhat lower time 
spent on intra-site communications. We configured MPWide 
to use 64 parallel tcp streams per path for the wide area 
communication channels, each with a tcp buffer size set 
at 768 kB and packet-pacing set at 10 MB/s maximum. 
We enabled some load balancing during the run, though 
we had to limit the boundary moving length per step to 
0.00001 of the box length due to memory constraints on 
our communication nodes and the presence of dense halos 
in our initial condition. 

In addition to the main run, we also performed three 
smaller runs using the same code across the same three 
supercomputers. These include one run with 1024^ particles 
using 80 cores per supercomputer, and two runs with 512^ 
particles using 40 cores per supercomputer. These runs 
also used a mesh size of 256"^, though we did reduce the 



Table in 

Properties of the three supercomputers used for our run. 
The measured peak number of tree force interactions (in 
millions) per second per core is given for each site in the 

BOTTOM row. 



Name 


Huygens 


Loulii 


HECToR 


Location 


Amsterdam 


Espoo 


Edinburgh 


Vendor 


IBM 


Cray 


Cray 


Architecture 


Power6 


XT4 


XT4 


# of cores 


3328 


4048 


12288 


CPU [GHz] 


4.7 


2.3 


2.3 


RAM / core [GB] 


4/8 


1/2 


2 


force calcs. / core [Mints/s] 


185 


256 


250 



sampling rate to respectively 10000 and 5000 for the runs 
with 1024'^ and 512^ particles. The force softening used for 
these runs were respectively 1.25kpc and 2.5kpc, and we set 
the boundary moving length limit to 0.01 of the box length. 
Some of the measurements were made using an opening 
angle 9 of 0.3, rather than 0.5. Using a smaller opening 
angle results a higher accuracy of the force integration on 
close range, but also results in a higher force calculation and 
tree structure communication time per step. 

B. Results 

The timing results of our production run are shown in 
Fig.[T] Here, we also added the wall-clock time results of the 
simulation run using 2048 cores on Huygens as reference. 
The simulation run across three sites is only ^ 9% slower 
per step than the single-site run, despite the slightly higher 
force calculation time due to the lower number of mesh 
cells. The peaks in wall-clock time of the single site run are 
caused by the writing of snapshots during those steps (we 
only wrote one snapshot at the end of the three site run). 
The total wide area communication overhead of our run is 
^ 10% at about 15s per step. Most of this time is required 
to exchange the tree structures between sites, though the 
communications for the parallelized particle-mesh require an 
additional ~' 2.5s per step. Despite the use of a shared wide 
area network, the communication performance of our run 
shows very little jitter and no large slowdowns. We provide 
a snapshot of the final state of the simulation (at z — 0), 
distributed across the three supercomputers, in Fig. [2] 

We also provide a numerical overview of the production 
run performance, as well as that of several other runs which 
use the new code, in Tab|IV| The communication overhead 
for the runs with 512'' particles is less than 20%, while 
the overhead for the run with 1024"^ particles is just 6.5%. 
The parallelization of the particle-mesh integration and the 
enhanced load balancing greatly improved the performance 
of these runs, especially in the case with 1024'^ particles. 
Here, the communication overhead was reduced by ~ 60% 
and the overall runtime by more than 25% compared to the 
previous version t7J. 



Table IV 

Overview of experiments performed with the enhanced 
sushi code across all three supercomputers. all times are 
measured per step, averaged over 10 steps. 
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total 






512^ 


120 


0.3 


6.925 


7.312 


39.70 


11.8-10.1 


512^ 


120 


0.5 


5.982 


6.335 


24.60 


9.9-8.8 


10243 


240 


0.3 


12.09 


14.04 


214.5 


17.0-14.9 


2048^ 


2048 


0.5 


15.40 


24.77 


167.7 


0.0026-0.0025 


2048^ 


2048 


0.5 


14.62 


23.13 


155.2 


0.0001-0 



V. User Experiences 

We have presented results from several cosmological sim- 
ulations which run across three supercomputers, including 
a production run lasting for 8 hours. In the process of 
seeking a solution for wide area message passing between 
supercomputers, requesting allocations, arranging network 
paths and preparing for the execution of these simulations, 
we have learned a number of valuable lessons. 

Primarily, we found that it is structurally possible to do 
high performace computing across multiple supercomputers. 
During the GBBP project we have run a considerable 
number of large-scale simulations using two or more super- 
computers, with results improving as we were able to further 
enhance the A^-body integrator and optimize the MPWide 
communication library for the wide area networks that we 
used. 

The cooperation of the resource providers was particularly 
crucial in this project, as they enabled previously unavailable 
network paths and provided us with means to initiate simu- 
lations concurrently at the different sites. However, reserv- 
ing networks and orchestrating concurrent supercomputer 
runs currently does require a disproportionate amount of 
time and effort, which makes performance optimization and 
debugging a challenging task. The effort required to run 
applications across supercomputers can be greatly reduced 
if resource providers were to adopt automated resource 
reservation systems for their supercomputers, and maintain 
shared high-bandwidth networking between sites. The per- 
sistent DEISA shared network connections helped greatly in 
our case, as we could use it at will without prior network 
reservations. 

The software environment across different supercomput- 
ers, even within the same distributed infrastructure, is very 
heterogeneous. This made it unattractive to use existing 
middleware or message passing implementations to make 
different sites interoperable. We chose to use a modular 
approach where we connected platform-specific optimized 
versions of the SUSHI code with the MPWide communi- 
cation library. With MPWide being a user-space tool that 
requires no external libraries or administrative privileges, 
we are able to install and run the simulation code in the 
locally preferred software environments on each site without 



needing any additional (grid) middleware. We recommend 
adopting a similar modular software approach in future 
distributed supercomputing efforts for its ease of installation 
and optimization, at least until resource providers present 
a homogeneous and interoperable software environment for 
distributed supercomputing. 

This paper focuses on the calculation and communica- 
tion performance aspects of a single application run across 
supercomputers. However, the methods presented here can 
be applied for several other purposes. During this project we 
were confronted with additional overhead introduced by disk 
I/O, as can be observed in Figure [T] With supercomputer disk 
performance and capacity improving at a much slower rate 
than the compute power, the deployment of an application 
across sites may help to eliminate a disk I/O performance 
bottleneck, though a detailed investigation will be needed 
to quantify such potential benefit. Additionally, the com- 
munication technique could be used to facilitate periodic 
exchanges between different simulation codes, each of which 
runs on a different site and tackles a different aspect of a 
complex multiscale or multiphysics problem. 

VI. Conclusion 

Our results show that cosmological production simu- 
lations run efficiently across supercomputers for a pro- 
longed time. The political effort required to arrange cross- 
supercomputer runs is considerable, and is an important 
reason why few people have attempted to run production 
simulations across supercomputers. We have shown that 
the added overhead of using a network of supercomputers 
is rather marginal for at least one optimized production 
application and that given the right (political) environment, 
supercomputers can be conveniently connected to form even 
larger high performance computing resources. 
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